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Abstract — We  show  a  high-speed  hardware  implementation  of 
a;  mod  z  that  can  be  pipelined  in  (J(n  m)  stages,  where  x  is 
represented  in  n  bits  and  2  is  represented  in  m  bits.  It  is  suitable 
for  large  x.  We  offer  two  versions.  In  the  first,  the  value  of  2 
is  fixed  by  the  hardware.  For  example,  using  this  circuit,  we 
show  a  random  number  generator  that  produces  more  than  11 
million  random  numbers  per  second  on  the  SRC-6  reconfigurable 
computer.  In  the  second,  2  is  an  independent  input.  This  is 
suitable  for  RNS  number  system  applications,  for  example.  The 
second  version  can  be  pipelined  in  0(n )  stages. 

Keywords:  x  mod  2  computation,  high-speed  modulo  reduction, 
mod  2  arithmetic 


I.  Introduction 

The  need  for  cryptographically-secure  systems  has  inspired 
interest  in  the  computation  of  a;  mod  z,  especially  when  x 
and  2  are  large  [6].  The  a;  mod  z  function  is  also  useful 
in  producing  (pseudo)  random  numbers.  Random  number 
generators  based  on  linear  recurrences  use,  for  example, 
xn+i  =  (P\  xn  +  P2)  mod  N,  where  x-o  is  the  seed  value 
[14].  For  example,  Lehmer’s  algorithm,  where  the  i-th  random 
number  is  Sj  =  asi-i  mod  p,  is  fast  enough  for  many 
simulation  applications  [10].  The  Blum-Blum-Shub  algorithm, 
where  a  bit  of  the  random  number  is  chosen  from  .sy  =  sf_ , 
mod  pq,  such  that  p  and  q  are  prime,  is  believed  to  be  as 
secure  as  encryption  methods  based  on  factorization  [5]. 

Another  application  is  the  residue  number  system  (RNS), 
where  addition,  subtraction,  and  multiplication  are  done  with¬ 
out  carries  [13].  In  this  application,  the  x  mod  z  operation  is 
used  in  the  binary-to-RNS  conversion,  the  RNS  operations, 
and  the  RNS-to-binary  conversion.  In  an  RNS  application, 
one  seeks  to  compute  simultaneously  smodz  for  one  value 
of  x  and  several  values  of  z.  Radix  converters  have  been 
proposed  that  use  an  LUT  cascade  [8],  Another  application 
is  in  primality  testing.  For  example,  the  famous  polynomial¬ 
time  algorithm  [  1  ]  for  determining  if  z  is  prime  must  compute 
x  mod  z.  Efficient  randomized  primality  algorithms  must  also 
compute  imodz,  e.g.  [2],  In  this  paper,  we  show  high¬ 
speed  compact  hardware  realizations  of  imodz  suitable  for 
implementation  on  an  FPGA. 

Surprisingly  there  is  little  work  on  imodz.  In  one,  [9] 
uses  the  sign  estimate  technique  to  estimate  when  the  sign 
of  (x  —  qz)  changes,  where  x  =  (qz  +  x)  mod  z.  They  show 
an  algorithm  for  computing  x  mod  z,  but  no  experimental 
results  are  shown.  [7]  discusses  a  unified  method  for  comput¬ 
ing  modular  multiplication,  but  shows  no  experimental  results. 
Neither  address  the  issue  of  whether  an  independent  z  can  be 
accommodated.  We  consider  a  circuit  that  computes  x  mod  z 
given  x  and  z,  as  n-  and  m-bit  numbers,  respectively.  It  can  be 


pipelined,  and  each  stage  consists  of  adders  and  multiplexors, 
so  the  delay  is  small  and  the  throughput  is  high.  When 
multiple-pipelines  are  used,  as  in  the  case  of  RNS  applications, 
the  pipelines  can  be  designed  to  have  the  same  length,  so  that 
the  residues  of  each  number  arrive  simultaneously.  We  show 
two  architectures.  In  the  first,  z  is  fixed  by  the  hardware.  In  the 
second,  z  can  be  changed  at  each  clock  period.  Experimental 
results  demonstrate  the  efficiency  of  our  design. 

II.  Basic  Implementation 

Our  design  benefits  from  the  following  viewpoint:  We 
consider  the  computation  of  x  mod  z  as  a  modulo  reduction 
process,  where,  at  each  stage,  the  magnitude  of  x  is  reduced, 
but  the  residue  remains  the  same.  This  continues  until  only 
the  residue  remains. 


A.  Computing  x  mod  z  for  fixed  z 

Fig.  1  shows  a  circuit  that  realizes  this  process  for  the  case 
where  x  has  8  bits  and  z  =  3.  First,  192  is  subtracted,  if 
possible.  Next,  96  is  subtracted,  if  possible,  etc..  This  circuit 
performs  a  sequence  of  subtract  operations  until  only  the 
residue  OUT  remains.  That  is,  IN  =  x  =  OUT  +  3 Q,  where 
OUT  is  a  2-bit  remainder  upon  division  of  IN  by  3  ( OUT 
=  0,  1,  or  2).  Representing  Q,  the  quotient,  as  an  7-bit  binary 
number  q6 26  +  . . .  +  qff1  +  q0  2°  yields 

0<IN=x=OUT+3q626  +  . .  .  +  3gi21  +  3g02°<256,  (1) 


where  the  limits  0  <  and  <  256  are  imposed  by  our 
specification  that  IN  =  x  is  represented  as  an  8-bit  standard 
binary  number. 


X 


Fig.  1.  Computation  of  x  mod  z,  where  x  has  8  bits  and  z  =  3  (z  is  fixed 
by  hardware). 


The  circuit  shown  in  Fig.  1  is  combinational.  When  n 
is  small,  such  a  circuit  is  satisfactory.  However,  when  n  is 
large,  the  delay  will  be  too  large,  and  it  is  necessary  that  it 
be  pipelined.  Table  I  shows  the  frequency,  number  of  LUTs, 
and  the  number  of  register  bits  used  in  the  Altera  Stratix  IV 
EP4SE530F43C3NES  FPGA  to  realize  x  mod  3,  for  x,  an  n- 
bit  number,  where  n  =  8,  16,  32,  64,  128,  and  256,  such 
that  a  pipeline  register  exists  at  the  output  of  each  stage.  The 
resulting  circuit  is  compact  and  fast.  For  example,  for  n  =  256 


x  mod  3 
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TABLE  I 

Resources  on  an  Altera  Stratix  IV  EP4SE530F43C3NES  FPGA 

NEEDED  TO  REALIZE  X  mod  Z,  WHERE  X  IS  AN  n-BIT  NUMBER  AND  z  =  3 
(z  IS  FIXED). 


n 

Freq. 

#  of  r- 

nput  LUTs 

Est.  #  of 

Total  #  of 

(MHz) 

6- 

5- 

4- 

3- 

Packed  ALMs 

Registers 

8 

573.5 

0 

14 

7 

15 

29(0%) 

50(0%) 

16 

498.1 

5 

12 

17 

124 

134(0%) 

226(0%) 

32 

422.0 

29 

35 

71 

501 

565(0%) 

962(0%) 

64 

276.2 

0 

0 

0 

3,513 

2,117(0%) 

2,738(0%) 

128 

209.7 

0 

0 

0 

13,764 

7,269(3%) 

8,902(2%) 

256 

143.8 

0 

0 

0 

55,001 

27,955(13%) 

33,549(7%) 

bits,  only  13%  of  the  packed  ALMs  available  are  used  and  the 
frequency  exceeds  100  MHz. 

Note  that,  in  this  circuit,  2  is  fixed  by  the  architecture. 
For  random  number  generation  and  cryptographic  applications, 
such  a  circuit  is  adequate.  In  the  next  section,  we  show  a  circuit 
where  z  can  be  changed. 

B.  Computing  x  mod  z,  where  z  is  an  independent  variable 

In  the  circuit  in  Fig.  1,  the  value  of  z  is  determined  by  the 
constant  6  applied  at  each  stage.  Here,  9  =  3-2®,  where  i 
is  an  index  to  the  stage,  such  that  i  =  0  corresponds  to  the 
rightmost  stage  and  i  =  6  corresponds  to  the  leftmost  stage. 
In  the  case  of  general  z,  9  =  z2l 2. 


Fig.  2.  Computation  of  x  mod  2:  (z  is  an  independent  input). 


We  can  modify  the  circuit  to  accommodate  general  z,  as 
shown  in  shown  in  Fig.  2.  However,  to  accommodate  various 
z,  we  have  to 


are  needed  to  yield  all  0’s  in  the  shifted  result.  Therefore,  we 
can  eliminate  the  two  leftmost  stages  and  apply  the  double 
bus  shifted  twice  to  the  next  stage.  On  the  left  of  Fig.  2,  z, 
which  is  3  in  this  example,  is  shifted  twice  right  (down).  In 
this  case,  the  top  8  bit  bus  has  all  0’s  and  so  the  leftmost 
stage  will  not  be  a  pass-through.  Instead,  the  value  of  z  will 
be  tested  against  192io  =  1100  OOOO2,  and  if  it  is  equal  to  or 
larger  than  this  value,  192  is  subtracted  from  x  and  passed  on 
to  the  stage  at  the  right.  If  x  is  less  than  192,  x  is  passed  on 
to  the  stage  on  the  right. 

Table  II  shows  the  frequency,  number  of  LUTs,  and 
the  number  of  register  bits  used  in  an  Altera  Stratix  IV 
EP4SE530F43C3NES  FPGA  to  realize  imodi,  where  x  is 
an  n-bit  number  and  n  =  8,  16,  32,  64,  128,  and  256.  A 
pipeline  register  exists  at  the  output  of  each  stage.  Note  that 
the  frequency  is  identical  or  nearly  the  same  as  with  the  circuit 
in  which  z  is  fixed  (Table  I).  As  expected,  more  resources  are 
needed  when  z  is  an  independent  input.  However,  only  slightly 
more  resources  are  needed.  That  is,  the  price  of  an  independent 
z  is  small. 

TABLE  II 

Resources  on  an  Altera  Stratix  IV  EP4SE530F43C3NES  FPGA 
NEEDED  TO  REALIZE  X  mod  2,  WHERE  X  IS  AN  n-BIT  NUMBER  AND  2  IS 


AN  INDEPENDENT  VARIABLE. 


n 

Freq. 

#  of  r- 

nput  LUTs 

Est.  #  of 

Total  #  of 

(MHz) 

6- 

5- 

4- 

3- 

Packed  ALMs 

Registers 

8 

632.6 

0 

18 

9 

16 

32(0%) 

56(0%) 

16 

493.3 

5 

12 

17 

138 

141(0%) 

240(0%) 

32 

422.0 

29 

35 

71 

531 

580(1%) 

992(0%) 

64 

276.2 

0 

0 

0 

3,575 

2,148(1%) 

2,800(0%) 

128 

209.7 

0 

0 

0 

13,890 

7,332(3%) 

9,028(2%) 

256 

143.8 

0 

0 

0 

55,255 

28,082(13%) 

33,803(7%) 

III.  Advanced  Implementation 
A.  Reducing  complexity 


1)  include  enough  stages  to  accommodate  the  worst  case 

2)  allow  unneeded  stages  to  ’’pass-through” 

For  example,  in  the  circuit  in  Fig.  1,  which  realizes 
x  mod  3,  7  stages  are  needed.  However,  to  realize  x  mod  6, 
a;  mod  12,  . . .,  and  a;  mod  192,  we  need  6,  5,  ...  ,  and  1  stage, 
respectively.  Therefore,  to  accommodate  any  z,  7  stages  are 
needed. 

The  pass-through  operation  can  be  implemented  by  testing 
the  shifted  value  at  the  previous  stage,  as  shown  on  the  left  of 
Fig.  2.  Here,  each  stage  has  two  inputs/outputs,  the  reduced 
x  value  (top,  single  bus),  and  a  shifted  version  of  z  (bottom, 
double  bus).  The  reduced  x  value  is  the  same  as  the  single 
rail  circuit  shown  in  Fig.  1.  The  shifted  version  of  z  is  the 
replacement  of  the  value  0i  in  each  stage  in  the  circuit  of 
Fig.  1.  At  the  leftmost  stage,  z  is  placed  on  the  top  half  of 
the  double  line  showing  two  n-bit  buses.  As  it  passes  through 
each  stage,  it  is  shifted  down  once.  If  there  is  at  least  one  1  bit 
in  the  upper  n  bits  of  the  double  bus,  the  NOR  gate  produces 
a  0,  which,  when  applied  to  the  AND  gate,  yields  a  0  at  the 
MUX  input.  This  causes  the  MUX  to  deliver  a  0  to  the  A-B 
gate,  and  the  stage  is  a  pass-through  stage.  Note  that  the  two 
left  stages  are  rendered  unnecessary  by  the  observation  that 
z  >  1.  That  is,  for  any  value  of  z  >  1,  at  least  two  right  shifts 


These  circuits  have  significant  redundancy.  For  example,  if 
x  has  1,024  bits,  the  pipelined  value  of  x  mod  z  requires  1,024 
flip-flops  per  stage.  If  z  is  3,  then  the  simplest  circuit  shown  in 
Fig.  1  requires  1,023  stages  or  a  total  of  more  than  1,000,000 
flip-flops.  However,  not  all  of  these  are  needed.  For  example, 
consider  the  rightmost  stage  in  Fig.  1,  which  produces  the 
output  a;  mod  3.  Only  two  of  the  eight  outputs  need  be  driven 
by  flip-flops;  all  of  the  others  can  be  driven  by  constant  0’s  or 
simply  omitted.  Similarly,  the  cell  to  its  left  need  only  produce 
three  outputs  driven  by  flip-flops.  In  all,  7  +  6+  5  +  4  +  3 
+  2  =  27  of  the  8  x  7  =  56  stage  outputs  shown  need  to  be 
driven  by  flip-flops. 

Further  savings  can  be  obtained  by  observing  that  the 
constant  term  has  a  simple  form.  At  the  leftmost  stage,  this 
constant  is  1100  OOOO2  =  19210.  It  is  sufficient  for  the 
comparator  and  subtractor  to  apply  to  three  bits  only. 

Fig.  3  shows  how  these  observations  reduce  the  complexity 
of  the  middle  stage  in  the  circuit  of  Fig.  1.  While  this  form 
of  the  circuit  is  desirable  because  the  next  step  is  to  allow 
for  values  of  z  different  from  3,  it  is  useful  to  observe  that 
this  stage  is  simply  realized  by  a  3-input  2-output  function, 
whose  truth  table  is  shown  in  Table  III.  For  example,  for  the 
first  three  rows,  y  <  Oil,  and  the  right  two  bits  of  Stage  Input 


y  passes  to  the  output  unchanged,  as  shown  in  the  column 
labeled  “Internal  Stage  y  mod  3”.  For  the  next  three  rows,  y  > 
Oil  and  y— Oil  is  passed  to  the  output,  as  shown.  The  last  two 
rows  show  don’t  care  values  in  the  column  labeled  “Internal 
Stage  ymod3”.  This  is  because  the  two  input  values  y  =  110 
and  111  never  occur.  That  is,  the  previous  stage  will  reduce 
these  values  to  less  than  y  =  110.  This  can  also  be  seen  by  the 
fact  that  /1/2  in  the  column  labeled  “Internal  Stage  y  mod  3” 
in  this  table  is  never  11.  However,  if  all  three  inputs  of  the  left 
stage  are  driven  by  inputs,  the  values  y  =  110  and  111  occur. 
In  this  case,  the  output  fif2  should  be  00  and  01,  respectively. 
Because  it  affects  one  stage  in  the  LUT  cascade,  the  don’t 
care  values  will  be  chosen  for  all  stages  so  that  the  left  stage 
produces  the  correct  output.  In  an  FPGA,  LUT’s  easily  realize 
this  function.  Memory-based  logic  is  appropriate  as  a  means 
to  implement  this  circuit  [16].  Fig.  4  shows  the  circuit  of  Fig. 
1  as  a  cascade  of  LUTs. 


Fig.  3.  Reduced  complexity  stage  for  the  calculation  of  a;  mod  3. 
TABLE  III 


Truth  table  of  a  simple  implementation  of  x  mod  z. 


Fig.  4.  Computation  of  x  mod  3  where  each  stage  contains  an  LUT. 

Table  IV  shows  the  frequency  and  resource  usage  for  the 
circuit  shown  in  Fig.  4.  It  is  clear  that  this  circuit  is  much 
faster  and  consumes  fewer  resources  than  the  two  previous 
realizations.  Like  the  circuit  in  Fig.  1,  this  new  more  compact 
circuit  computes  jimodz,  where  2  is  determined  by  the 
architecture. 


B.  Reducing  latency 

C.  Tradeoff  between  complexity  and  latency 

These  examples  show  a  tradeoff  between  the  memory  used 
to  store  the  functions  and  latency.  In  the  case  of  a  single 
lookup  table,  a  latency  of  one  clock  can  be  achieved,  but  the 
memory  is  512  bits.  One  can  reduce  this  to  one-fifth,  but  the 


TABLE  IV 

Resources  on  an  Altera  Stratix  IV  EP4SE530F43C3NES  FPGA 

NEEDED  TO  REALIZE  X  mod  2,  WHERE  X  IS  AN  n-BIT  NUMBER  AND  2  =  3 
(2  IS  FIXED). 


n 

Freq. 

#  of  r-Input  LUTs 

Est.  #  of 

Total  #  of 

(MHz) 

6- 

5- 

4- 

3- 

Packed  ALMs 

Registers 

8 

1520.2 

0 

0 

0 

12 

14(0%) 

27(0%) 

16 

1520.2 

0 

0 

0 

28 

60(0%) 

119(0%) 

32 

1520.2 

0 

0 

0 

60 

248(1%) 

495(0%) 

64 

1520.2 

0 

0 

0 

124 

1,008(0%) 

2,015(0%) 

128 

1097.9 

0 

0 

0 

252 

1.134(0%) 

2,268(0%) 

256 

1097.9 

0 

0 

0 

508 

1,262(0%) 

2,524(0%) 

latency  increases  to  eight  clocks.  We  can  make  the  following 
observation. 

Lemma  1.  In  the  realization  of  a;  mod  2  where  z  is  fixed,  let  n 
and  m  be  the  number  of  bits  to  represent  x  and  z,  respectively. 
Let  S  be  the  number  of  stages  in  the  pipeline,  and  let  a  be 
the  speed-up  compared  to  the  full-latency  system,  where  a  = 
1  (no  speed-up ),  2,  3,  etc..  Let  M  be  the  total  memory  in  bits 
needed  in  this  realization.  Then, 


S  = 


n  —  m 
a 


and  M  =  m2m+a 


n  —  m 
a 


(2) 


Proof:  Each  stage  has  m  +  a  inputs  and  m  outputs.  Collec¬ 
tively,  the  stages  have  S(m  +  a)  inputs,  of  which  ( S  —  1  )m 
are  driven  by  outputs  from  the  stages.  Thus, 


S(m  +  a)  —  (S  —  1  )m  >  n  and  S  =  -  . 

a 

The  observation  that  each  unit  requires  m2m+a  bits,  yields 

(2). 


Lemma  1  is  similar  to  Lemma  5.1.5  of  [16].  It  follows  from 
(2)  that  the  smallest  a  (=  1)  yields  the  smallest  storage  M. 
If  n  —  m  is  even,  then  a  =  2  yields  the  same  M,  since  the 
2 m+a  term  doubles,  while  the  \ term  halves.  This  can 
also  be  concluded  from  Theorem  9.8.2  of  [16],  which  applies 
to  general  cascade  circuits.  However,  as  a  increases  beyond 
2,  the  2m+a  dominates,  and  M  increases  rapidly.  If  n  —  m  is 
odd,  this  observation  is  approximately  true.  From  this,  we  can 
see  that  there  is  little  penalty  to  choosing  a  =  2,  and,  from 
this  point  on,  reducing  latency  increases  memory  significantly. 

We  can  obtain  a  similar  lemma  for  the  case  where  2  is  a 
separate  input  by  observing  that  2  applies  to  all  stages,  and 
each  stage  must  produce  an  output  that  depends  on  2.  This 
increases  by  m  the  number  of  inputs,  so  that  each  unit  stores 
m22m+a  instead  of  ra2m+“  bits.  Thus, 


Lemma  2.  In  the  realization  of  x  mod  2  where  z  is  an 
independent  input,  let  n  and  m  be  the  number  of  bits  to 
represent  x  and  z,  respectively.  Let  S  be  the  number  of  stages 
in  the  pipeline,  and  let  a  be  the  speed-up  over  the  full-latency 
system,  where  a  =  1  (no  speed-up ),  2,  3,  etc..  Let  M  be  the 
total  memory  in  bits  needed  in  this  realization.  Then, 


S  = 


n  —  to 
a 


and  M  =  m22m+a 


n  —  m 


a 


(3) 


IV.  Experimental  Results 

A  Verilog  program  was  written  for  the  SRC-6  reconfigurable 
computer  that  computes  (pseudo)  random  numbers  using 
Lehmer’s  algorithm  [10] 

s,;+i  =  7 Si  mod  2,  (4) 

where  the  values  7  =  16807  =  75  and  2  =  231  —  1  are 
suggested  in  [14].  This  same  expression  was  also  implemented 
by  rand,  the  uniform  random  number  generator  in  MATLAB 
Version  4  [11].  (4)  is  a  full -period  generating  function,  where 
7 Si  and  2  can  be  represented  in  46  and  31  bits,  respectively. 
In  our  circuit,  x  mod  z  is  realized  using  the  architecture 
in  which  2  is  fixed  (Fig.  1).  A  total  of  16  stages  (16=46- 
31+1)  are  needed.  The  delay  of  each  stage  is  less  than  5 
ns.  Therefore,  two  stages  can  be  cascaded  between  pipeline 
registers,  since  the  delay  of  these  two  stages  is  less  than 
one  100  MHz  clock  period  (10  ns).  As  a  result,  x  mod  z  is 
realized  in  eight  clock  periods.  The  multiplication  7 s,  requires 
an  additional  stage.  Thus,  nine  clock  periods  are  required  to 
generate  each  random  number.  The  random  number  generator 
is  realized  as  a  producer  in  a  producer-consumer  stream.  In 
all,  it  takes  36,900  clocks  to  generate  4,096  random  numbers 
or  9  clocks  per  random  number  plus  36  clocks  for  overhead. 
With  a  clock  running  at  100  MHz  and  9  clocks  per  random 
number,  this  random  number  generator  produces  more  than  1 1 
million  random  numbers  per  second. 

Fig.  5  shows  the  first  128  random  numbers.  Here,  a  black 
box  represents  1  and  a  white  box  represents  0.  The  first 
number,  so,  the  seed,  is  at  the  left.  For  illustrative  purposes, 
we  chose  So  =  1  (the  single  1  at  the  least  significant  bit 
position  is  at  the  bottom).  The  next  two  numbers  are  the 
binary  representation  of  si  =  168071  and  S2  =  168072, 
both  of  which  are  unchanged  by  the  mod  231  —  1  operation. 
The  properties  of  this  random  number  generator  have  been 
extensively  studied,  and  it  is  denoted  as  the  minimal  standard 
generator  in  [14],  Even  with  the  non-random  choice  of  the 
seed  so  =  1,  Fig.  5  supports  the  statement  in  [14]  that  the 
minimal  standard  generator  is  ’’demonstrably  random”. 


0  20  40  60  80  100  120 


Time 

Fig.  5.  Sequence  of  random  numbers  generated  by  the  Lehmer  algorithm. 

Table  V  shows  the  resources  used.  Only  a  fraction  of  one 
FPGA’s  resources  are  needed.  The  FPGA,  in  this  case,  is 
the  Xilinx  Virtex2p  XC2VP100  FPGA  with  Package  FF1696 
and  Speed  Grade  -5.  Our  design  met  the  100  MHz  timing 
constraint  imposed  by  the  SRC-6  Carte  toolchain  by  a  slight 
margin  (100.4  MHz). 

V.  Concluding  Remarks 

We  show  fast  and  compact  circuits  that  realize  x  mod  z.  In 
one  version,  2  is  fixed;  its  value  is  determined  by  the  hardware. 


TABLE  V 

Resources  on  a  Xilinx  Virtex2p  XC2VP100  FPGA  used  to 

IMPLEMENT  THE  LEHMER  RANDOM  NUMBER  GENERATOR  ON  THE  SRC-6 


Number  of 

Used/ Available 

Percentage 

Slice  Flip-Flops 

2,516/88,192 

2% 

4-Input  LUTs 

2,794/88,192 

3% 

Occupied  Slices 

2,557/44,096 

5% 

In  another,  2  is  an  independent  input.  Our  experimental  results 
show  that  the  complexity  of  the  latter  is  only  slightly  larger 
than  that  of  the  former,  while  the  speeds  are  nearly  the  same. 

We  illustrate  the  implementation  of  these  circuits  in  the 
generation  of  random  numbers  using  Lehmer’s  algorithm  on 
the  SRC-6  reconfigurable  computer.  With  a  clock  speed  of 
100  MHz,  we  are  able  to  produce  random  numbers  at  a  rate 
of  more  than  1 1  million  per  second. 
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